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Abstract 

Bayesian optimization has recently emerged as a popular and efficient tool 
for global optimization and hyperparameter tuning. Currently, the established 
Bayesian optimization practice requires a user-defined bounding box which is as¬ 
sumed to contain the optimizer. However, when little is known about the probed 
objective function, it can be difficult to prescribe such bounds. In this work we 
modify the standard Bayesian optimization framework in a principled way to al¬ 
low automatic resizing of the search space. We introduce two alternative methods 
and compare them on two common synthetic benchmarking test functions as well 
as the tasks of tuning the stochastic gradient descent optimizer of a multi-layered 
perceptron and a convolutional neural network on MNIST. 


1 Introduction 

Bayesian optimization has recently emerged as a powerful tool for the efficient global optimization 
of black box objective functions. Since the technique was introduced over 50 years ago, it has 
been applied for many different applications. Perhaps the most relevant application for the machine 
learning community is the automated tuning of algorithms ||2l[T3][T8l|2Tl[8l. However, the current 
state of the art requires the user to prescribe a bounding box within which to search for the optimum. 
Unfortunately, setting these bounds—often arbitrarily—is one of the main difficulties hindering 
the broader use of Bayesian optimization as a standard framework for hyperparameter tuning. For 
example, this obstacle was raised at the NIPS 2014 Workshop on Bayesian optimization as one of 
the open challenges in the field. 

In the present work, we propose two methods that are capable of growing the search space as the 
optimization progresses. The first is a simple heuristic which regularly doubles the volume of the 
search space, while the second is a regularization method that is practical and easy to implement in 
any existing Bayesian optimization toolbox based on Gaussian Process priors over objective func¬ 
tions. 

At a high level, the regularized method is born out of the observation that the only component of 
Bayesian optimization that currently requires a bounding box on the search space is the maximiza¬ 
tion of the acquisition function; this constraint is necessary because acquisition functions can have 
suprema at infinity. By using a non-stationary prior mean as a regularize^ we can exclude this 
possibility and use an unconstrained optimizer, removing the need for a bounding box. 

1.1 Related work 

Although the notion of using a non-trivial Gaussian process prior mean is not new, it is usually 
expected to encode domain expert knowledge or known structure in the response surface. Only 
one recent work, to the best of the authors’ knowledge, has considered using the prior mean as a 
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regularization term and it was primarily to avoid selecting points along boundaries and in corners of 
the bounding box m- 

In this work we demonstrate that regulization can be used to carry out Bayesian optimization without 
a rigid bounding box. We further introduce a volume doubling baseline, which we compare to the 
regularization approach. While the regularized algorithms exhibit a much more homogeneous search 
behaviour {i.e. boundaries and comers are not disproportionately favoured), the volume doubling 
baseline performs very well in practice. 

We begin with a brief review of Bayesian optimization with Gaussian processes in the next section, 
followed by an introduction to regularization via non-stationary prior means in Section]^ including 
visualizations that show that our proposed approaches indeed venture out of the initial user-defined 
bounding box. Sectionj^reports our results on two synthetic benchmarking problems and two neural 
network tuning experiments. 

2 Bayesian optimization 

Consider the global optimization problem of finding a global maximizer x,^ G argmax^gRd /(x) 
of an unknown objective function f : i-G M.. In this work, the function / is assumed to be a 

black-box for which we have no closed-form expression or gradient information. We further assume 
that / is expensive to evaluate so we wish to locate the best possible input x with a relatively 
small budget of N evaluations. Finally, the evaluations y G K of the objective function are noise- 
cormpted observations, and for the remainder of this work, we assume a Gaussian noise distribution, 
y ~ A/'(/(x), tj^). In contrast to typical Bayesian optimization settings, we do not assume the 
arg max to be restricted to a bounded subset A" C 

Bayesian optimization is a sequential model-based approach which involves (i) maintaining a prob¬ 
abilistic surrogate model over likely functions given observed data; and (ii) selecting future query 
points according to a selection policy, which leverages the uncertainty in the surrogate to nego¬ 
tiate exploration of the search space and exploitation of current modes. The selection policy is 
represented by an acquisition function K, where the subscript indicates the implicit 

dependence on the surrogate and, by extension, on the observed data = {(x^, More 

precisely, at iteration n: an input x„+i is selected by maximizing the acquisition function the 
black-box is queried and produces a noisy t/„+i; and the surrogate is updated in light of the new 
data point (x„+i, ?/„+i). Finally, after N queries the algorithm must make a final recommendation 
xjv which represents its best estimate of the optimizer. 

2.1 Gaussian processes 

In this work we prescribe a Gaussian process (GP) prior over functions na. When combined 
with a Gaussian likelihood, the posterior is also a GP and the Bayesian update can be computed 
analytically. Note that other surrogate models, such as random forests, have also been proposed in 
the model-based optimization literature M- 

A Gaussian process GP(/ro, k) is fully characterized by its prior mean function p,Q : ^ and its 

positive-definite kernel, or covariance, function fc : x i—R. Given any finite collectioij^of n 

points xi:„, the values of/(xi),..., f{x.n) are jointly Gaussian with mean m, where rrii := po(xi)^ 
and n X n covariance matrix K, where Kij = k{xi, xj )—hence the term covariance function. 

Given the Gaussian likelihood model, the vector of concatenated observations y = j/i:„ is also 
jointly Gaussian with covariance K + a^I. Therefore, at any arbitrary test location x, we can query 
our surrogate model (the GP) for the function value /(x) conditioned on observed data The 
quantity /(x) is a Gaussian random variable with the following mean /iri(x) and marginal variance 



/i„(x) =/xo(x)-f k(x)'^(K-f cr^I) ^(y-m) 
o’n W = k(x, x) - k(x)'^(K -f cr^Ij-^kjx), 


( 1 ) 

( 2 ) 


Here we use the convention ai:j = {ai,... ,aj}. 
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where k(x) is the vector of cross-covariance terms between x and Xi:„. Throughout this work we 
use the squared exponential kernel 


fcsE(x,x') = 6»oexp(-ir^), (3) 

where r = (x — x')^A~^(x — x') and A is a diagonal matrix of d length scales Oi-d and Oq is the 
kernel amplitude. Collectively referred to as the hyperparemeter, 6 = 0o:d parameterizes the kernel 
function k. When the noise variance is unknown, it can be added as a model hyperparameter as 
well. Similarly, the most common agnostic choice of prior mean is a constant bias /ro(x) = b which 
can also be added to 9. 


Hyperparameter marginalization. As in many regression tasks, the hyperparameter 9 must 
somehow be specified and has a dramatic effect on performance. Common tuning techniques such 
as cross-validation and maximum likelihood are either highly data-inefficient or run the risk of over¬ 
fitting. Recently, a Bayesian treatment of the hyperparameters via Markov chain Monte Carlo has 
become standard practice in Bayesian optimization US). Similarly in the present work, we specify 
an uninformative prior on 9 and approximately integrate it by sampling from the posterior p{9\'Dn) 
via slice sampling. 

2.2 Acquisition functions 

So far we have described the statistical model we use to represent our belief about the unknown 
objective / given data and how to update this belief given new observations. We have not 
described any mechanism or policy for selecting the query points xi:„. One could select these 
arbitrarily or by grid search but this would be wasteful; uniformly random search has also been 
proposed as a surprisingly good alternative to grid search IS. There is, however, a rich literature on 
selection strategies that utilize the surrogate model to guide the sequential search, i.e. the selection 
of the next query point x„+i given 

The key idea behind these strategies is to define an acquisition functions a : i—> M which quanti¬ 
fies the promis^of any point in the search space. The acquisition function is carefully designed to 
trade off exploration of the search space and exploitation of current promising neighborhoods. There 
are three common types of acquisition functions; improvement-based, optimistic, and information- 
based policies. 

The improvement-based acquisition functions, probability and expected improvement (PI and El, 
respectively), select the next point with the most probable and most expected improvement, re¬ 
spectively MM- On the other hand, the optimistic policy upper confidence bound (GP-UCB) 
measures, marginally for each test point x, how good the corresponding observation y will be in 
a low and fixed probability “good case scenario”—hence the optimism ll20ll . In contrast, there ex¬ 
ist information-based methods such as randomized probability matching, also known as Thompson 
sampling Gam a [mm, or the more recent entropy search methods EllElIhl. Thompson sam¬ 
pling selects the next point according to the distribution of the optimum x*, which is induced by the 
current posterior iiihiiaEi. Meanwhile, entropy search methods select the point x that is expected 
to provide the most information towards reducing uncertainty about x*. 

In this work we focus our attention on El, which is perhaps the most common acquisition function. 
With the GP surrogate model, El has the following analytic expression 

“n W = (/r„(x) -t)$ -f g„(x)^ ^ 

V CT„(x) J \ crn(x) J 

where $ and denote the standard normal cumulative distribution and density functions, respec¬ 
tively. Note however that the technique we outline in the next section can readily be extended to any 
Gaussian process derived acquisition function, including all those mentioned above. 


^The way “promise” is quantified depends on whether we care about cumulative losses or only the final 
recommendation Xiv. 
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3 Unbounded Bayesian optimization 

3.1 Volume doubling 

Our first proposed heuristic approach consists in expanding the search space regularly as the op¬ 
timization progresses, starting with an initial user-defined bounding box. This method otherwise 
follows the standard Bayesian optimization procedure and optimizes within the bounding box that is 
available at the given time step n. This approach requires two parameters: the number of iterations 
between expansions and the growth factor 7 . Naturally, to avoid growing the feasible space A" by a 
factor that is exponential in d, the growth factor applies to the volume of X. Finally, the expansion 
is isotropic about the centre of the domain. In this work, we double (7 = 2) the volume every id 
evaluations (only after an initial latin hypercube sampling of id points). 

3.2 Regularizing improvement policies 

We motivate the regularized approach by considering improvement policies {e.g. El); however, in the 
next section we show that this proposed approach can be applied more generally to all GP-derived 
acquisition functions. Improvement policies are a popular class of acquisition functions that rely on 
the improvement function 

-^(x) = (/(x) - t)I[/(x) > r] (5) 

where t is some target value to improve upon. Expected improvement then compute E[J(x)] under 
the posterior GP. 

When the optimal objective value /* is known and we set r = /*, these algorithms are referred 
to as goal seeking ca. When the optimum is not known, it is common to use a proxy for /* 
instead, such as the value of the best observation so far, j/+; or in the noisy setting, one can use 
either the maximum of the posterior mean or the value of the posterior at the incumbent x+, where 

x+ G argmaXxgxi:„ Mn(x). 

In some cases, the above choice of target t = can lead to a lack of exploration, therefore it 
is common to choose a minimum improvement parameter ^ > 0 such that r = (1 - 1 - (for 
convenience here we assume is positive and in practice one can subtract the overall mean to 
make it so). Intuitively, the parameter ^ allows us to require a minimum fractional improvement 
over the current best observation ?/+. Previously, the parameter ^ had always been chosen to be 
constant if not zero. In this work we propose to use a function ^ IR+ which maps points in 

the space to a value of fractional minimum improvement. Poliowing the same intuition, the function 
^ lets us require larger expected improvements from points that are farther and hence acts as a 
regularizer that penalizes distant points. The improvement function hence becomes: 

f(x) = (/(x) - t(x))I[/(x) > t(x)], with t(x) = (1 -h ^(x))j/+, (6) 

where the choice of ^(x) is discussed in Section [3.2.2 

3.2.1 Extension to general policies 

In the formulation of the previous section, our method seems restricted to improvement policies. 
However, many recent acquisition functions of interest are not improvement-based, such as GP- 
UCB, entropy search, and Thompson sampling. In this section, we describe a closely related formu¬ 
lation that generalizes to all acquisition functions that are derived from a GP surrogate model. 

Consider expanding our choice of non-stationary target r in Equation (|^ 

/(x) = (/(x) - j/+(l -h C(x)))I[/(x) > J/+(l -h C(x))] 

= (/(x) - y+C(x) - 2 /+)I[/(x) - J/+C(x) > y+] 

= (/(x) - ?/+)I[/(x) > 2 /+] (7) 

where / is the posterior mean of a GP from o with prior mean /io(x) = /ro(x) — j/+^(x). Notice 
the similarity between Q and Q. Indeed, in its current form we see that the regularization can be 
achieved simply by using a different prior mean po and a constant target 2 /+. This duality can be 
visualized when comparing the left and right panel of Pigure [2 
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x-fi x+R y^+R *^ 3 ’*' *+ 3 '^ 

(a) Minimum improvement view (b) Prior mean view 

Figure 1: Visualization of the two alternate views of regularization in Bayesian optimization. The objective 
function and posterior surrogate mean are represented as black dashed and solid lines, respectively, with grey 
shaded areas indicating ±2 ct„. Integrating the surrogate model above the target (orange shaded area) results 
in the regularized El acquisition function (magenta and cyan). Using a non-stationary target with a constant 
prior mean (left) or a fixed target with a non-stationary prior mean (right) lead to indistinguishable acquisition 
functions, which decay at infinity. 


Precisely speaking, these two views are not exactly equivalent. Starting with the po prior mean, the 
posterior mean yields an additional term 

k(x)T(K + a2l)-ie(X), (8) 

where [^(X)]i = ^(xi). This term is negligible when the test point x is far from the data X due 
to our exponentially vanishing kernel, making the two alternate views virtually indistinguishable in 
Figure [2 However, with this new formulation we can apply the same regularization to any policy 
which uses a Gaussian process, e.g. Thompson sampling or entropy search. 

3.2.2 Choice of regularization 

By inspecting Equation Q, it is easy to see that any coercive prior mean function would lead to an 
asymptotically vanishing El acquisition function; in this work however we consider quadratic (Q) 
and isotropic hinge-quadratic (H) regularizing prior mean functions, defined as follows (excluding 
the constant bias b) 


^q(x) = (x - x)'^diag(w2) ^(x - x), 

(9) 

err(x) = I[||x - x|b > 

(10) 


Both of these regularizers are parameterized by d location parameters x, and while has an ad¬ 
ditional d width parameters w, the isotropic regularizer has a single radius R and a single f3 
parameter, which controls the curvature of outside the ball of radius i?; in what follows we fix 
13 = 1 . 

Fixed prior mean hyperparameters. We are left with the choice of centre x and radius parame¬ 
ters R (or widths w). Unlike the bias term b, these parameters of the regularizer are not intended to 
allow the surrogate to better fit the observations In fact, using the marginal likelihood to esti¬ 
mate or marginalize ijj = {x, R, w}, could lead to fitting the regularizer to a local mode which could 
trap the algorithm in a suboptimal well. Eor this reason, we use an initial, temporary user-defined 
bounding box to set ip at the beginning of the run; the value of ip remains fixed in all subsequent 
iterations. 
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Figure 2; Visualization of selections x„ made by the EI-H algorithm on two toy objective functions consisting 
of three Gaussian modes in one (left) and two (right) dimensions. Grey lines delimit the initial bounding box; 
grey square markers indicate the initial latin hypercube points, while the orange and blue points distinguish 
between the first and second half of the evaluation budget of 30d, respectively. In the two-dimensional example, 
the height of the Gaussians are indicated by -l-l, +2, and -1-3. 


3.3 Visualization 

Before describing our experimental results, Figure [^provides a visualization of El with the hinge- 
quadratic prior mean, optimizing two toy problems in one and two dimensions. The objective func¬ 
tions consist of three Gaussian modes of varying heights and the initial bounding box does not 
include the optimum. We draw attention to the way the space is gradually explored outward from 
the initial bounding box. 


4 Experiments 

In this section, we evaluate our proposed methods and show that they achieve the desirable behaviour 
on two synthetic benchmarking functions, and a simple task of tuning the stochastic gradient descent 
and regularization parameters used in training a multi-layered perceptron (MLP) and a convolutional 
neural network (CNN) on the MNIST dataset. 

Experimental protocol. For every test problem of dimension d and every algorithm, the opti¬ 
mization was run with an overall evaluation budget of 30d including an initial 3d points sampled 
according to a latin hypercube sampling scheme (as suggested in ifTOll ). Throughout each particular 
run, at every iteration n we record the value of the best observation up to n and report these in 
Figure]^ 

Algorithms. We compared the two different methods from Sectionj^to the standard El with a fixed 
bounding box. Common random seeds were used for all methods in order to reduce confounding 
noise. All algorithms were implemented in the pybo framework available on githul:|^ ||2l, and are 
labelled in our plots as follows: 

El: Vanilla expected improvement with hyperparameter marginalization. 

EI-V: Expected improvement with the search volume doubled every 3d iterations. 

EI-H/Q: Regularized El with a hinge-quadratic and quadratic prior mean, respectively. 

Note that for the regularized methods EI-H/Q, the initial bounding box is only used to fix the location 
and scale of the regularizers, and to sample initial query points. In particular, both regularizers are 
centred around the box centre; for the quadratic regularizer the width of the box in each direction 
is used to fix w, whereas for the hinge-quadratic R is set to the box circumradius. Once these 
parameters are fixed, the bounding box is no longer relevant. 


^ https://github.com/mwhoffman/pybo 
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Figure 3: Best observations as optimization progresses. Plotting mean and standard error over 40 (Hartmann), 
14 (MLP), and 19 (CNN) repetitions. 


4.1 Synthetic benchmarks: Hartmann 3 and 6 

The Hartmann 3 and 6 functions (numbers refer to their dimensionality) are standard, synthetic 
global optimization benchmarking test functions. In a separate experiment, indicated by an ap¬ 
pended asterisk (*), we consider a randomly located bounding box of side length 0.2 within the unit 
hypercube. This window is unlikely to include the global optimum, especially in the six-dimensional 
problem, which makes this a good problem to test whether our proposed methods are capable of use¬ 
ful exploration outside the initial bounding box. 

4.2 MLP and CNN on MNIST 

The MNIST hand-written digit recognition dataset is a very common simple task for testing neural 
network methods and architectures. In this work we optimize the parameters of the stochastic gra¬ 
dient descent (SGD) algorithm used in training the weights of the neural network. In particular, we 
consider an MLP with 2048 hidden units with tanh non-linearities, and a CNN with two convolu¬ 
tional layers. These examples were taken from the official GitHub repository of torch? demo^ 
The code written for this work can be readily extended to any other demo in the repository or any 
torch script. 

For this problem, we optimize four parameters, namely the learning rate and momentum of the SGD 
optimizer, and the Li and L 2 regularization coefficients. The parameters were optimized in log 
space (base e) with an initial bounding box of [—3, —1] x [—3, —1] x [—3,1] x [—3,1], respectively. 
For each parameter setting, a black-box function evaluation corresponds to training the network 
for one epoch and returning the test set accuracy. To be clear, the goal of this experiment is not 
to achieve state-of-the-art for this classification task but instead to demonstrate that our proposed 
algorithms can find optima well outside their initial bounding boxes. 

4.3 Results 

Figure 1^ shows that for the Hartmann tests, the Bayesian optimization approaches proposed in this 
paper work well. The results confirm our hypothesis that the proposed methods are capable of useful 
exploration outside the initial bounding box. We note that when using the entire unit hypercube as 
the initial box, all the Bayesian optimization techniques exhibit similar performance as in this case 
the optimum is within the box. The Hartmann tests also show that our volume doubling heuristic is 

"'https://github.com/torch/demos 
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Figure 4; Pairwise scatterplot of selections (upper triangle) and recommendations (lower triangle) for the 
MLP-MNIST experiment. For example, the second plot of the first row corresponds to a scatter plot of the 
selected learning rates vs. momentum parameters/or a single seed. In contrast, the first plot of the second row 
corresponds to a scatter plot of the recommended learning rates and momentum parameters over all 14 runs. 
The initial hounding hox and sample points (for this particular run) are shown as a hlack rectangle and hlack 
square dots, respectively. All other points respect the color scheme of Figurej^ (L 2 cropped out for space.) 


a good baseline method. Although it is less effective than EI-H as the dimensionality increases, it is 
nonetheless an improvement over standard El in all cases. 

The MNIST experiment shows good performance from all three methods EI-{V,H,Q}, particularly 
from the hinge-quadratic regularized algorithm. Indeed, when compared to the standard El, EI-H 
boasts a 20% improvement in accuracy on the MLP and almost 10% on the CNN. 

We believe that EI-H performs particularly well in settings where a small initial bounding box is 
prescribed because the hinge-quadratic regularizer allows the algorithm to explore outward more 
quickly. In contrast, EI-Q performs better when the optimum is included in the initial box; we 
suspect that this is due to the fact that the regularizer avoids selecting boundary and corner points, 
which El and El-V tend to do, as can be seen in Eigure Indeed, the green dots (El-V) follow the 
corners of the growing bounding box. In contrast, EI-H/Q do not exhibit this artefact. 

5 Conclusion and future work 

In this work, we propose a versatile new approach to Bayesian optimization which is not limited to a 
search within a bounding box. Indeed, given an initial bounding box that does not include the opti¬ 
mum, we have demonstrated that our approach can expand its region of interest and achieve greater 
function values. Our method fits seamlessly within the current Bayesian optimization framework, 
and can be readily used with any acquisition function which is induced by a GP. 

Einally, this could have implications in a distributed Bayesian optimization scheme whereby mul¬ 
tiple Bayesian optimization processes are launched, each responsible for a neighbourhood of the 
domain. Interactions between these processes could minimize overlaps or alternatively ensure a 
certain level of redundancy, which would be helpful in a noisy setting. 

We emphasize that in this work we have addressed one of the challenges that must be overcome 
toward the development of practical Bayesian optimization hyper-parameter tuning tools. A full so¬ 
lution, however, must also address the issues of dimensionality, non-stationarity, and early stopping. 
Eortunately, there has been great recent progess along these directions, and the methods proposed in 
this paper provide a complementary and essential piece of the puzzle. 
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